Project Overview

Motivation

Bikesharing has become more and more popular in major cities in the United States “Americans are falling in love with bikeshare”. The Bluebikes is a Metro Boston’s public bike share program, with over 1,800 bikes and more than 200 stations across Boston, Brookline, Cambridge, and Somerville. Bluebike operates under the customer and subscriber systems. In 2017, the total number of trips is 1,313,837, and the number of subscribers is 14,577. For the individual customer, the single trip pass costs 2.5 dollars for 30 mins ride, with 2.5 dollars for an additional 30 mins. For the subscriber, Bluebike requires a membership fee of 99 dollars per year and the subscriber can have unlimited rides, with each ride under 45 mins. For any additional 30 mins post 45 mins ride, the subscriber would need to pay $2.5.

This graph shows that the number of Bluebike subscribers and customers had increased dramatically from 2011 to 2013 and has maintained the high number of riders until 2017.

This line chart shows the total number of the Bluebike trips, which could count multiple rides of one user. The number of trips also increased in the first part of the years as the number of users did. On the contrary to the number of riders, the trip counts have still increased in these few years.

Objectives

For this project, we would like to understand

  • What are the general demographics of bike riders?
  • How does the frequency of bike riding change with time, day and month of the year?
  • Does weather affect bike riding?
  • Given Bluebike only collects information on subscribers, how can we predict the characteristics and conditions for overtime users?

Approach

Starting with web scrapping we collected datasets from Boston Hubway and weather sites, focusing on the year 2017. We merged these datasets into a single csv file for our subsequent analysis. We performed exploratory analysis of bluebike riders in Boston, with the hopes of trying to have a general idea of who a typical Blue bike subscriber was, and how their riding pattern changed with changing weather. We then drilled down on bluebike subscribers who ride overtime, meaning more than 45 mins. Since these people pay an extra $2.5 per 30mins, and thus make bluebike company extra revenue. This informed our subsequent analysis, using multiple logistic regression techniques and Decision tree analysis. We tried to predict the typical oversubscriber and what day of the week such a rider would be out and about.

Data Preparation

# Importing the all Bluebike data in 2017
dat_201701<-read.csv("201701-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201702<-read.csv("201702-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201703<-read.csv("201703-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201704<-read.csv("201704-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201705<-read.csv("201705-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201706<-read.csv("201706-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201707<-read.csv("201707-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201708<-read.csv("201708-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201709<-read.csv("201709-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201710<-read.csv("201710-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201711<-read.csv("201711-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201712<-read.csv("201712-hubway-tripdata.csv",stringsAsFactors = FALSE)

# Combining the data
dat_2017 <- rbind(dat_201701, dat_201702, dat_201703, dat_201704, dat_201705,
                  dat_201706, dat_201707, dat_201708, dat_201709, dat_201710,
                  dat_201711, dat_201712)

Adding variables

# age, age_cat, duration_min, year, month, month_abb, day, hour, wday
bbike <- dat_2017 %>%
  mutate(birth.year = as.numeric(birth.year)) %>%
        mutate(age = 2017 - birth.year) %>%
  mutate(age_cat = case_when(
    .$age >= 10 & .$age < 20 ~ 1,
    .$age >= 20 & .$age < 30 ~ 2,
    .$age >= 30 & .$age < 40 ~ 3,
    .$age >= 40 & .$age < 50 ~ 4,
    .$age >= 50 & .$age < 60 ~ 5,
    .$age >= 60 & .$age < 70 ~ 6,
    .$age >= 70 & .$age < 80 ~ 7,
    .$age >= 80 ~ 8)) %>%
  mutate(duration_min = tripduration / 60) %>%
  mutate(year = year(starttime), 
         month = month(starttime),
         month_abb = month(starttime, label = TRUE, abbr = TRUE), 
         day = day(starttime),
         hour = hour(starttime), 
         wday = wday(starttime, label = TRUE, abbr = TRUE))
# dist_km: Trip distance (km)
setDT(bbike)[ , dist_km := distGeo(matrix(c(start.station.longitude, start.station.latitude), ncol = 2),matrix(c(end.station.longitude, end.station.latitude), ncol = 2))/1000]
bbike <- as.data.frame(bbike)
# overtime
bbike <- bbike %>%
  mutate(overtime = ifelse(duration_min > 45, 1, 0))
# user_start, user_end: number of users at the start/end station
bbike <- bbike %>%
  group_by(start.station.id) %>%
  mutate(user_start = n())

bbike <- bbike %>%
  group_by(end.station.id) %>%
  mutate(user_end = n())

Combining weather information

# temp_max, temp_min, rain, snownice(snow or ice)
weather <- read_excel("boston_weather.xls")
bbike <- bbike %>%
  group_by(year, month, day) %>%
  left_join(., weather, by = c("year", "month", "day")) 

Analysis

Number of trips by gender and age

bbike %>%
  mutate(gender = as.factor(gender)) %>%
  filter(gender != "0") %>%  #1 is male, 2 is female, 0 is unknown
  ggplot(aes(factor(age_cat))) +
  geom_bar(aes(fill = gender)) +
  scale_x_discrete("Age", 
                   labels = c('10s','20s','30s','40s','50s','60s','70s','80+')) +
  scale_fill_manual("Gender", values = c("#2CA3E1", "#FF8F1C"), labels = c("Male", "Female")) +
  scale_y_continuous("Number of Rides") +
  ggtitle("Total Number of Trips by Gender and Age") +
  theme_bw()

In this bar chart, people with gender-code “0” were removed from the analysis. Age was stratified into bands of 10 years in width and shows as categorical variable for comparison. Main Blue Bike users are 20s or 30s, and male users are dominant in each generation.

Trip distance

# Distance
bbike %>% 
  filter(birth.year > 1900 & birth.year <= 2017)%>%
  group_by(age_cat) %>% 
  summarize(avg = mean(dist_km), se = sd(dist_km) / sqrt(n())) %>% 
  ggplot(aes(age_cat, avg))+ 
  geom_errorbar(aes(ymin = avg - 2*se, ymax = avg+2 * se), width = 0.2, alpha = 0.5)+
  geom_point(color = "#FF8F1C")+
  geom_line(color = "#1D428A")+
  scale_x_continuous(breaks=(c(1,2,3,4,5,6,7,8)), 
                     labels=c("10s","20s","30s","40s","50s","60s","70s","80+"))+
  xlab(expression(paste(Age, " (years)")))+
  ylab(expression(paste(Distance," (km)"))) +
  ggtitle("Trip Distance and Age") +
  theme_bw()

Here there’s an initial upward trend we see with age and distance covered in kilometers. As age increased from the teens to about age 50, the distance covered also increased. This positive linear relationship was at age 50, after which we began to see a decline. Interestingly, there was another upward trend after age 80.

Blue bike Stations

# Create data frame
dat_station<- read.csv("Hubway_Stations_as_of_July_2017.csv")
dat_2017_station <- bbike %>%
filter(!birth.year=="\\N")%>%
                        filter(birth.year>1900 & birth.year<=2017)%>%
  group_by(start.station.id) %>% summarize(number = n(), start.station.latitude=first(start.station.latitude), start.station.longitude=first(start.station.longitude),start.station.name=first(start.station.name))%>% filter(start.station.latitude>0)
summary(dat_2017_station)
##  start.station.id     number      start.station.latitude
##  Min.   :  1.00   Min.   :    4   Min.   :42.30         
##  1st Qu.: 54.75   1st Qu.: 1794   1st Qu.:42.34         
##  Median :108.50   Median : 4750   Median :42.36         
##  Mean   :111.91   Mean   : 5625   Mean   :42.36         
##  3rd Qu.:173.25   3rd Qu.: 7614   3rd Qu.:42.37         
##  Max.   :232.00   Max.   :35702   Max.   :42.41         
##  start.station.longitude start.station.name
##  Min.   :-71.17          Length:196        
##  1st Qu.:-71.11          Class :character  
##  Median :-71.08          Mode  :character  
##  Mean   :-71.09                            
##  3rd Qu.:-71.06                            
##  Max.   :-71.01

Popularity of Bluebikes stations in 2017

# Distinguish stations by color based on the number of users
getColor <- function(df){
  sapply(df$number, function(number) {
  if(number %in% 1:4000) {
    "green"
  } else if(number %in% 4001:10000) {
    "orange"
  } else if(number >10000) {
    "red"
  } else {
    "blue"
  } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(dat_2017_station)
)

leaflet(dat_2017_station) %>% addTiles() %>%
  addAwesomeMarkers(~start.station.longitude
, ~start.station.latitude, icon=icons, label=~as.character(number), popup = ~start.station.name)

This map classified Blue Bike stations into three categories based on the number of trips in 2017: the red dots represent Bluebike stations that were used the most, and the green dots represent the stations used the least. The most popular station was “MIT at Mass Ave / Amherst St”, at the number of trips 35,700 per year. The least popular station was “Four Corners - 157 Washington St”, at the number of trips just four per year.

Number of trips in each hour within a day through Sunday to Saturday

bbike %>% ggplot(aes(hour))+
  geom_bar(fill = "#006AC6", color = "#0050B5")+
  scale_x_continuous(breaks=(c(0,4,8,12,16,20)))+
  xlab("Time")+
  ylab("Number of Users") +
  ggtitle("Trips Related to Time and the Day of Week") +
  facet_wrap(~ wday) +
  theme_fivethirtyeight()

These bar graphs show the number of trips in each hour within a day through Sunday to Saturday. During weekdays, the usage of Bluebike peaked during commute hours (7-9am and 4-6pm).During weekends, the distribution is more like a bell curve, with peak times between 12pm and 4pm

bbike %>% 
  filter(gender != 0) %>%
  ggplot(aes(hour)) +
  geom_bar(aes(fill = factor(gender))) +
  scale_x_continuous(breaks=(c(0,4,8,12,16,20))) +
  scale_fill_manual("Gender", values = c("#0090DA", "#FF8F1C"), 
                      labels = c("Male", "Female")) +
  xlab("Time")+
  ylab("Number of users") +
  ggtitle("Monthly Trend of Trips") +
  facet_wrap(~ month) +
  theme_economist_white()

The figure shows seasonal comparison of users throughout a year. Hint! It’s Boston! As expected, the amount of rides decreased dramatically from January to March. Men used Bluebikes much frequently across hours and months.

Distribution of Trip Duration

bbike_member <- bbike %>%
  filter(usertype == "Subscriber")
summary(bbike_member$duration_min)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     6.10     9.78    13.31    15.48 61276.80

As you can see, the data seems to have wrong information. The very long tripduration might be attributable to lost or other errors. Therefore, we limit the range from 0 to 75 minutes for the duration in this study.

bbike_member <- bbike_member %>%
  filter(duration_min < 50)
bbike_member %>%
  mutate(group = ifelse(rain == 0, "no rain", "rain")) %>%
  ggplot(aes(duration_min, y = ..count.., fill = group)) +
  geom_density(alpha = 0.2) +
scale_fill_discrete("Rain", labels = c("No rain", "Rain")) +
  xlab("Trip Duration (min)") +
  ylab("Riders") +
  ggtitle("Trip Duration and Rain") +
  theme_bw()

Since the maximum trip duration 61276.80 minutes, we decided to cap our analysis at 75mins. As a bike-leasing organization, a very long trip duration might probably be due to a missing bike or some unknown reason. Unsurprisingly, we see that many riders use the bikes when the weather is more conducive. The number of blue-bike users when it is not raining is more than double the number of people who use the bikes in the rain.

bbike_member %>%
  filter(gender != 0) %>%
  mutate(gender = as.factor(gender)) %>%
  ggplot(aes(duration_min, y = ..count.., fill = gender)) +
  geom_density(alpha = 0.2) +
  scale_fill_manual("Gender", values = c("#0090DA", "#FF8F1C"),
                    labels = c("Male", "Female")) +
  xlab("Trip Duration (min)") +
  ylab("Riders") +
  ggtitle("Trip Duation and Gender") +
  theme_bw()

This gender difference was however not very obvious after about 40mins. Again we can infer from this graphic that, gender played little role among over time riders.

Prediction

# Keep only the subscribers in the dataset 
bbikemember <- bbike %>%
  ungroup() %>%
  filter(usertype == "Subscriber" & duration_min < 75) %>%
  mutate(gender = as.factor(gender), day = as.factor(day),
         hour = as.factor(hour)) %>%
  mutate(rain_cat = ifelse(rain == 0, 0, 1)) %>%
  mutate(snownice_cat = ifelse(snownice == 0, 0, 1)) %>%
  mutate(overtime = ifelse(duration_min < 15, 0, 1)) %>%
  mutate(overtime = as.factor(overtime)) %>%
  mutate(satsun = ifelse(wday %in% c("Sat", "Sun"), 1, 0)) %>%
  mutate(satsun = as.factor(satsun)) %>%
  filter(gender != 0)

Given Bluebike only collects information on subscribers and there is no User ID available in the dataset, we would like to know the characteristics and conditions for overtime users, as overtime users would provide additional revenue to Bluebike.

Decision tree

output_tree1 <- ctree(overtime ~ gender + satsun,
                     data = bbikemember)
plot(output_tree1)

The decision tree analysis shows us that when the rider is female and the riding time is over the weekend, the subscriber tends to ride overtime. When the rider is male and the riding time is over weekdays, the subscriber has the least probability to ride overtime.

set.seed(6)
s <- sample(1:nrow(bbikemember), 500000)

dt_train = bbikemember[s,]
dt_test = bbikemember[-s,]

dtm = ctree(overtime ~ gender + factor(rain_cat) + factor(snownice_cat), data = dt_train)

plot(dtm)

dt_test$predClass = predict(dtm, newdata = dt_test, type = "response") 
dt_test$predProb = sapply(predict(dtm, newdata = dt_test, type= "prob"),'[[',2) 

dt_test$predClass2 = 0
dt_test$predClass2[dt_test$predProb >= 0.3] = 1

table(dt_test$predClass2, dt_test$overtime)
##    
##          0      1
##   0 340631 110554
##   1  97432  45622

The decision tree analysis shows us that when the rider is female and there is no rain, the subscriber tends to ride overtime. When the rider is male and there is both rain and snow, the subscriber has the least probability to ride overtime.

roc_pred <- prediction(dt_test$predProb, dt_test$overtime)
pf <- performance(roc_pred, "tpr", "fpr")
plot(pf, col = "#0050B5")
abline(0, 1, col="grey")

# getting area under the curve
performance(roc_pred,"auc")@y.values
## [[1]]
## [1] 0.5428584

However, after checking the performance of the tree model, we figured out that the model did not seem to predict well because the AUC is 0.5428584. To find more accurate model, we proceeded to the logistic regression by putting more predictors in our new model.

logistic regression

set.seed(1)

Train <- createDataPartition(bbikemember$overtime, p = 0.5, list = FALSE)
training <- bbikemember[ Train, ]
testing <- bbikemember[ -Train, ]

#overall accuracy 
p = 0.358  #290235/810623
y_hat <- sample(c("0","1"), length(testing), replace = TRUE, prob=c(p, 1-p)) %>% 
  factor(levels = levels(bbikemember$overtime))
mean(y_hat == testing$overtime)
## [1] 0.4657506
#logistic regression
glm.fit <- training %>% 
  glm(overtime ~ gender + age + month + satsun + rain_cat +
               snownice_cat + user_start + user_end, data=., family = "binomial")

summary(glm.fit)
## 
## Call:
## glm(formula = overtime ~ gender + age + month + satsun + rain_cat + 
##     snownice_cat + user_start + user_end, family = "binomial", 
##     data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3354  -0.8276  -0.7025   1.3495   2.7237  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.370e-01  1.620e-02 -45.484  < 2e-16 ***
## gender2       3.173e-01  7.022e-03  45.180  < 2e-16 ***
## age           7.089e-03  2.714e-04  26.115  < 2e-16 ***
## month         5.586e-03  1.246e-03   4.482  7.4e-06 ***
## satsun1       2.488e-01  7.862e-03  31.645  < 2e-16 ***
## rain_cat     -8.604e-02  6.848e-03 -12.564  < 2e-16 ***
## snownice_cat -6.166e-01  4.143e-02 -14.884  < 2e-16 ***
## user_start   -2.917e-05  4.184e-07 -69.709  < 2e-16 ***
## user_end     -2.724e-05  3.893e-07 -69.985  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 630108  on 547119  degrees of freedom
## Residual deviance: 611999  on 547111  degrees of freedom
## AIC: 612017
## 
## Number of Fisher Scoring iterations: 4
exp(cbind(OR=coef(glm.fit), confint.default(glm.fit)))
##                     OR     2.5 %    97.5 %
## (Intercept)  0.4785536 0.4635945 0.4939954
## gender2      1.3733690 1.3545963 1.3924017
## age          1.0071142 1.0065785 1.0076502
## month        1.0056015 1.0031481 1.0080609
## satsun1      1.2824760 1.2628654 1.3023912
## rain_cat     0.9175533 0.9053197 0.9299521
## snownice_cat 0.5397628 0.4976664 0.5854201
## user_start   0.9999708 0.9999700 0.9999717
## user_end     0.9999728 0.9999720 0.9999735
#prediction
p_hat <- predict(glm.fit, newdata=testing, type="response")
y_hat <- ifelse(p_hat > 0.5, 1, 0) %>% factor()

#confusion matrix 
table(predicted = y_hat, actual = testing$overtime)
##          actual
## predicted      0      1
##         0 403056 143450
##         1    346    267
confusionMatrix(data = factor(y_hat), reference = factor(testing$overtime))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 403056 143450
##          1    346    267
##                                          
##                Accuracy : 0.7372         
##                  95% CI : (0.736, 0.7383)
##     No Information Rate : 0.7373         
##     P-Value [Acc > NIR] : 0.5966         
##                                          
##                   Kappa : 0.0015         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.999142       
##             Specificity : 0.001858       
##          Pos Pred Value : 0.737514       
##          Neg Pred Value : 0.435563       
##              Prevalence : 0.737320       
##          Detection Rate : 0.736688       
##    Detection Prevalence : 0.998880       
##       Balanced Accuracy : 0.500500       
##                                          
##        'Positive' Class : 0              
## 

ROC

model<-training %>%
glm(overtime ~ gender + age + month + satsun + rain_cat +
             snownice_cat + user_start + user_end, data=., family = "binomial")

predpr <- predict(model, newdata=testing, type="response")
roccurve<- roc(testing$overtime~predpr)
roccurve
## 
## Call:
## roc.formula(formula = testing$overtime ~ predpr)
## 
## Data: predpr in 403402 controls (testing$overtime 0) < 143717 cases (testing$overtime 1).
## Area under the curve: 0.6119
predict <- predict(model, newdata=testing, type = "response")

ROCRpred <- prediction(predict, testing$overtime)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, col = "#FF8F1C", text.adj = c(-0.2,1.7))
abline(0, 1, col = "grey")

In this logistic regression model, we put gender, age, weekends (versus weekdays), rain, snow, the popularity of starting stations, and the popularity of ending stations to predict overtime riders. We created training dataset for building a model and testing dataset for check the performance of the model. Because of the big sample size, we got statistically higher or lower odds of over-riding in all of the predictors. However, if we look at the odds ratio, we could realize something interesting. The odds of over-riding when it snowed is 0.54 (95% CI: 0.50 to 0.59) times the odds when it did not. The odds of over-riding when it rained is 0.92 (95% CI: 0.91 to 0.93) times the odds when it did not. If it rains or snows, people tend to return bike on time. On the other hand, the odds of over-riding on weekends is 1.28 (95% CI: 1.26 to 1.30) times the odds on weekdays. The odds of over-riding among females is 1.37 (95% CI: 1.35 to 1.39) times the odds among males. The logistic regression shows similar result, as when it snows or rains, users tend to return bike on time. Similarly, females tend to use the Bluebikes longer than 45 minutes especially during weekends. Finally we drew the ROC curve using the testing dataset. However, AUC is 0.61, which means this model might be better than our previous model but still not so good in terms of predicting over-ride.